Computer Assisted Topic Classification

نویسندگان

  • Dustin Hillard
  • Stephen Purpura
  • John Wilkerson
چکیده

1 Running Head: COMPUTER ASSISTED TOPIC CLASSIFICATION PRE-PUBLICATION VERSION. Cite as JITP 4:4, Forthcoming. There are a few known changes to the colors of the graphs and there may be other editorial changes as suggested by the editors. Abstract Social scientists interested in mixed methods research have traditionally turned to human annotators to classify the documents or events used in their analyses. The rapid growth of digitized government documents in recent years presents new opportunities for research but also new challenges. With more and more data coming online, relying on human annotators becomes prohibitively expensive for many tasks. For researchers interested in saving time and money while maintaining confidence in their results, we show how a particular supervised learning system can provide estimates of the class of each document (or event). This system maintains high classification accuracy and provides accurate estimates of document proportions, while achieving reliability levels associated with human efforts. We estimate that it lowers the costs of classifying large numbers of complex documents by 80% or more. 3 Technological advances are making vast amounts of data on government activity newly available, but often in formats that are of limited value to researchers as well as citizens. In this paper, we investigate one approach to transforming these data into useful information. " Topic classification " refers to the process of assigning individual documents (or parts of documents) to a limited set of categories. It is widely used to facilitate search as well as the study of patterns and trends. To pick an example of interest to political scientists, a user of the Library of Congress' THOMAS website (http://thomas.loc.gov) can use its Legislative Indexing Vocabulary (LIV) to search for congressional legislation on a given topic. Similarly, a user of a commercial Internet service turns to a topic classification system when searching, for example, Yahoo! Flikr for photos of cars or Yahoo! Personals for postings by men seeking women. Topic classification is valued for its ability to limit search results to documents that closely match the user's interests, when compared to less selective keyword-based approaches. However, a central drawback of these systems is their high costs. Humans— who must be trained and supervised—traditionally do the labeling. Although human annotators become somewhat more efficient with time and experience, the marginal cost of coding each document does not really decline as the scope of the project expands. This has led many researchers …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Segmentation Assisted Object Distinction for Direct Volume Rendering

Ray Casting is a direct volume rendering technique for visualizing 3D arrays of sampled data. It has vital applications in medical and biological imaging. Nevertheless, it is inherently open to cluttered classification results. It suffers from overlapping transfer function values and lacks a sufficiently powerful voxel parsing mechanism for object distinction. In this work, we are proposing an ...

متن کامل

Two New Methods of Boundary Correction for Classifying Textural Images

With the growth of technology, supervising systems are increasingly replacing humans in military, transportation, medical, spatial, and other industries. Among these systems are machine vision systems which are based on image processing and analysis. One of the important tasks of image processing is classification of images into desirable categories for the identification of objects or their sp...

متن کامل

The Comparison of Computer Assisted Teaching and Traditional Explicit Method in Learning / Teaching English Vocabulary.

This review surveys research on second language vocabulary teaching and learning since1999. It first considers the distinction between incidental and intentional vocabulary learning.Although learners certainly acquire word knowledge incidentally while engaged in variouslanguage learning activities, more direct and systematic study of vocabulary is also required.There is a discussion of how word...

متن کامل

Review of “Computer-Assisted and Web-Based Innovations in Psychology, Special Education, and Health” edited by James K. Luiselli & Aaron J. Fischer

Computer-Assisted and Web-Based Innovations in Psychology, Special Education, and Health edited by James K. Luiselli & Aaron J. Fischer. London & San Diego: Academic Press, 2016. 408pp., $74.95 (hardcover), ISBN 9780128020753

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008